Leiden clustering#

A quick introduction to Leiden clustering…#

The Leiden algorithm is a clustering method that is an improved version of the Louvain algorithm. It aims to identify cohesive groups or clusters within a larger network by optimizing the network’s modularity, which measures the strength of division into communities.

The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts by assigning each node in the network to its own cluster, which forms the initial partition. It then iterates through each node, considering moving it to a neighbouring cluster to see if it improves the network’s modularity. This process continues until no further improvements can be made, resulting in a refined partition of the network where nodes in the same cluster are more connected to each other than to nodes in other clusters.

What sets Leiden apart is its ability to efficiently handle large networks, using a refinement step that avoids exhaustive exploration of all possible node movements. Instead, it focuses on relevant changes that enhance modularity. The Leiden module has a resolution parameter which allows to determine the scale of the partition cluster and therefore the coarseness of the clustering. A higher resolution parameter leads to more clusters.

import pandas as pd
import numpy as np

from pathlib import Path
import utils

# dimensionality reduction
from sklearn.decomposition import PCA
from umap import umap_ as UMAP

# plotting
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()

# clustering
from leiden_clustering import LeidenClustering

Data loading and pre-processing#

DATA_PATH = Path.cwd() / "../data"
FIG_PATH = Path.cwd() / "../src"
FILTER_TIMEOUT = True
if FILTER_TIMEOUT:
    FIG_PATH = FIG_PATH / "figures/leiden/"
else:
    FIG_PATH = FIG_PATH / "figures/leiden_with_timeout/"
FIG_PATH = FIG_PATH.resolve()
print(f"Figure path: {FIG_PATH}")
data = {
    Path(f).stem: pd.read_csv(f, index_col=0) for f in DATA_PATH.glob("combined_*.csv")
}
print(list(data.keys()))
if not FIG_PATH.exists():
    FIG_PATH.mkdir(parents=True)
Figure path: C:\Users\Cyril\Desktop\Code\ada-2023-project-adamants\src\figures\leiden
['combined_metrics_finished_edges', 'combined_metrics_finished_paths', 'combined_metrics_unfinished_edges', 'combined_metrics_unfinished_paths']
features_finished_paths = data["combined_metrics_finished_paths"].reset_index(drop=True)
features_unfinished_paths = data["combined_metrics_unfinished_paths"].reset_index(
    drop=True
)

Let’s filter the timeout for this time… In the notebook leiden clustering with timeout you will see that it only adds a cluster containing all the timeout, but it shifts the means of all the features that use the time, reason why we remove them

if FILTER_TIMEOUT:
    features_unfinished_paths = features_unfinished_paths[
        features_unfinished_paths["type"] != "timeout"
    ]
combined_df = pd.concat([features_finished_paths, features_unfinished_paths], axis=0)
combined_df["finished"] = [1] * len(features_finished_paths) + [0] * len(
    features_unfinished_paths
)
combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING] = utils.normalize_features(
    combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING]
)
combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].isna().sum()
durationInSec             0
backtrack                 0
numberOfPath              3
position_mean             0
position_std              0
path_length               0
coarse_mean_time          0
semantic_similarity       0
ratio                     0
path_closeness_abs_sum    0
dtype: int64
combined_df.dropna(subset=utils.FEATURES_COLS_USED_FOR_CLUSTERING, inplace=True)
X = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].copy().values
combined_df
hashedIpAddress timestamp durationInSec path rating backtrack numberOfPath path_length coarse_mean_time position_mean ... semantic_similarity path_degree_abs_sum path_clustering_abs_sum path_degree_centrality_abs_sum path_betweenness_abs_sum path_closeness_abs_sum index target type finished
0 6a3701d319fc3754 2011-02-15 03:26:49 0.450421 ['14th_century', '15th_century', '16th_century... NaN -0.292447 -0.397854 1.246512 -0.037357 0.012312 ... 1.896769 5.197422 -3.233355 -3.234431 -5.004633 -4.342440e-01 NaN NaN NaN 1
1 3824310e536af032 2012-08-12 06:36:52 -0.246765 ['14th_century', 'Europe', 'Africa', 'Atlantic... 3.0 -0.292447 0.302599 -0.127465 -0.104634 -0.645120 ... 1.187599 7.010212 -2.318889 -1.421642 -3.524030 9.426499e-01 NaN NaN NaN 1
2 415612e93584d30e 2012-10-03 21:10:40 0.247484 ['14th_century', 'Niger', 'Nigeria', 'British_... NaN -0.292447 0.483877 0.971189 -0.133470 -0.379822 ... 0.756814 5.236336 -3.212821 -3.195518 -4.962358 -4.075266e-01 NaN NaN NaN 1
3 64dd5cd342e3780c 2010-02-08 07:25:25 -1.198556 ['14th_century', 'Renaissance', 'Ancient_Greec... NaN -0.292447 -0.591540 -0.649073 -1.028104 -0.454105 ... 0.515755 NaN NaN NaN NaN -9.236796e-16 NaN NaN NaN 1
4 015245d773376aab 2013-04-23 15:27:08 0.508421 ['14th_century', 'Italy', 'Roman_Catholic_Chur... 3.0 -0.292447 -0.922649 0.659053 0.399219 0.064624 ... -0.870712 6.315864 -2.737924 -2.115989 -4.991059 5.543467e-01 NaN NaN NaN 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18974 109ed71f571d86e9 2014-01-15 11:01:43 0.353632 ['Montenegro', 'World_War_II', 'United_States'... NaN -0.292447 0.508379 0.659053 0.196939 0.277953 ... 0.499099 6.294228 -2.251847 -2.137626 -3.529575 -1.163800e-01 24867.0 Hurricane_Georges restart 0
18975 109ed71f571d86e9 2014-01-15 11:36:08 -0.467209 ['Wine', 'Georgia_%28country%29', 'Russia'] NaN -0.292447 0.651736 -1.321543 0.340616 -1.439055 ... -1.735091 NaN NaN NaN NaN -9.236796e-16 24868.0 History_of_post-Soviet_Russia restart 0
18976 109ed71f571d86e9 2014-01-15 12:00:12 0.551507 ['Turks_and_Caicos_Islands', 'United_States', ... NaN -0.292447 0.775702 0.298719 0.676819 0.796475 ... 0.057767 7.555088 -1.505405 -0.876766 -2.304742 1.010689e+00 24869.0 Iraq_War restart 0
18977 109ed71f571d86e9 2014-01-15 12:06:45 0.539368 ['Franz_Kafka', 'Tuberculosis', 'World_Health_... NaN 0.378896 0.802247 0.298719 0.247967 -0.319231 ... -0.508621 3.669951 -3.305998 -4.761902 -7.644179 -1.910789e+00 24870.0 Cholera restart 0
18980 1cf0cbb3281049ab 2014-01-15 21:54:01 1.276127 ['Mark_Antony', 'Rome', 'Tennis', 'Hawk-Eye', ... NaN -0.292447 0.936431 -0.127465 1.885496 0.685458 ... -1.882444 5.595124 -1.730546 -2.836729 -5.768889 4.568906e-01 24874.0 Feather restart 0

63117 rows × 26 columns

We save the histograms for the website

save_folder = FIG_PATH / "histogramms"
if not save_folder.exists():
    save_folder.mkdir(parents=True)
#save histogramms of the features in separeted images without background
for feature in utils.FEATURES_COLS_USED_FOR_CLUSTERING:
    fig, ax = plt.subplots(figsize=(8, 6))
    sns.histplot(data=combined_df, x=feature, ax=ax).set(title=feature, xlabel=None, ylabel=None)
    plt.savefig(save_folder / f"{feature}_histogram.png", transparent=True)
    plt.close()

Let’s find the dimensionality of the data#

# find the dimensionality of the data
with plt.style.context("dark_background"):
    pca = PCA(n_components=X.shape[1])
    pca.fit(X)
    # plot the explained variance
    plt.scatter(range(1, X.shape[1] + 1), pca.explained_variance_ratio_, c="cyan")
    plt.xlabel("Number of components", fontsize=16)
    plt.ylabel("Explained variance", fontsize=16)
    plt.title("Percentage of explained variance\nof each component of the PCA", fontsize=20)
    plt.savefig(FIG_PATH / "pca_explained_variance.png", transparent=True)
    plt.show()
_images/c4a9070c386a75f40db47ca56b2dc3e2cd839eb2f5ed2b423e6aba216daa9d31.png

We will keep only the 7 first dimensions

Clustering#

We will cluster the data using Leiden with a resolution parameter of 0.2 and a PCA with 7 compenents.

#fix seed for reproducibility
np.random.seed(2)
clustering = LeidenClustering(
    leiden_kws={"n_iterations": -1, "seed": 0, "resolution_parameter": 0.2},
    pca_kws={"n_components": 7},
)
clustering.fit(X)
print(np.unique(clustering.labels_))
c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:346: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:348: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:358: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")

[0 1 2 3 4]

Visualization of the results#

# UMAP
umap = UMAP.UMAP(n_components=3, metric="euclidean")
result_umap_euc = umap.fit_transform(X)
c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:346: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:348: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:358: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")

UMAP plot#

fig = px.scatter_3d(
    result_umap_euc,
    x=0,
    y=1,
    z=2,
    color=clustering.labels_.astype(str),
    title="UMAP, showing Leiden clustering",
    # reduce size points
    size_max=0.1,
    category_orders={"color": [str(i) for i in range(0, len(np.unique(clustering.labels_)))]},
)
fig.update_layout({"plot_bgcolor": "#14181e", "paper_bgcolor": "#14181e"})
fig.update_layout(font_color="white")
fig.update_layout(scene=dict(xaxis=dict(showticklabels=False), yaxis=dict(showticklabels=False), zaxis=dict(showticklabels=False)))
fig.update_layout(legend_title_text="Cluster")
fig.update_layout(legend = dict(bgcolor = 'rgba(0,0,0,0)'))
fig.update_layout(scene=dict(xaxis_title="UMAP 1", yaxis_title="UMAP 2", zaxis_title="UMAP 3"))
fig.update_layout(scene = dict(
    xaxis = dict(
            backgroundcolor="rgba(0, 0, 0,0)",
            # gridcolor="rgba(0, 0, 0,0)", # gridcolor is for logo
            showbackground=True,
            zerolinecolor="white",),
    yaxis = dict(
        backgroundcolor="rgba(0, 0, 0,0)",
        # gridcolor="rgba(0, 0, 0,0)",
        showbackground=True,
        zerolinecolor="white"),
    zaxis = dict(
        backgroundcolor="rgba(0, 0, 0,0)",
        # gridcolor="rgba(0, 0, 0,0)",
        showbackground=True,
        zerolinecolor="white",),),
)

fig.write_html(FIG_PATH / "umap_leiden.html")
display(fig)

We can see that this algorithm does not find the same cluster as the UMAP and it is a good thing ! You can have a look at the notebook optics_DBSCAN if you want to see why…

Features distributions in the clusters#

# plot histogram features colored by cluster
# plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].copy().dropna()
# plot_data["cluster"] = clustering.labels_
# plot_data["cluster"] = plot_data["cluster"].astype("category")
# n_features = len(plot_data.columns) - 1
# n_cols = 5
# n_rows = int(np.ceil(n_features / n_cols))
# fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 10))
# axs = axs.flatten()
# for i, col in enumerate(plot_data.columns):
#     if col == "cluster":
#         continue
#     sns.histplot(
#         data=plot_data,
#         x=col,
#         hue="cluster",
#         ax=axs[i],
#         stat="density",
#         common_norm=False,
#     )
# plt.suptitle("Histograms of normalized features colored by cluster")
# plt.subplots_adjust(top=0.95)
# plt.savefig(FIG_PATH / "histograms_features_colored_by_cluster.png", transparent=True)
# plt.show()
# save_folder = FIG_PATH / "histogramms_colored_by_cluster"
# if not save_folder.exists():
#     save_folder.mkdir(parents=True)
# # save the same histograms but in separated images
# for i, col in enumerate(plot_data.columns):
#     if col == "cluster":
#         continue
#     fig, ax = plt.subplots(figsize=(8, 6))
#     sns.histplot(
#         data=plot_data,
#         x=col,
#         hue="cluster",
#         ax=ax,
#         stat="density",
#         common_norm=False,
#     ).set(title=col, xlabel=None, ylabel=None)
#     plt.savefig(save_folder / f"{col}_histogram_colored_by_cluster.png", transparent=True)
#     plt.close()
# plot histogram features colored by cluster
with plt.style.context("dark_background"):
    plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].copy().dropna()
    plot_data["cluster"] = clustering.labels_
    plot_data["cluster"] = plot_data["cluster"].astype("category")
    n_features = len(plot_data.columns) - 1
    n_cols = 3
    n_rows = int(np.ceil(n_features / n_cols))
    fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
    axs = axs.flatten()
    for i, col in enumerate(plot_data.columns):
        if col == "cluster":
            continue
        sns.histplot(
            data=plot_data,
            x=col,
            hue="cluster",
            ax=axs[i],
            stat="density",
            common_norm=False,
            palette="tab10",
        )
        utils.set_axis_style(axs, i)

    plt.suptitle("Histograms of normalized features colored by cluster", fontsize=20)
    plt.subplots_adjust(top=0.95)
    plt.show()
_images/6358b3c652b8c09544ecbd97cdff05814bd16f4d3d53429a3efa52a25316b464.png

Boxplots of features distribution#

# plot the boxplots of features colored by cluster
# combined_df["finished_normalized"] = (
#     combined_df["finished"] - combined_df["finished"].mean()
# ) / combined_df["finished"].std()
# plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING + ["finished_normalized"]].copy()
# plot_data["cluster"] = clustering.labels_
# plot_data["cluster"] = plot_data["cluster"].astype("category")
# n_features = len(plot_data.columns) - 1
# n_cols = 4
# n_rows = int(np.ceil(n_features / n_cols))
# fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
# fig.suptitle("Boxplots of normalized features colored by cluster")
# plt.subplots_adjust(top=0.95)
# axs = axs.flatten()
# for i, col in enumerate(plot_data.columns):
#     if col == "cluster":
#         continue
#     sns.boxplot(data=plot_data, x="cluster", y=col, ax=axs[i])
# plt.savefig(FIG_PATH / "boxplots_features_colored_by_cluster.png", transparent=True)
# plt.show()
# # save the same boxplots but in separated images
# save_folder = FIG_PATH / "boxplots_colored_by_cluster"
# if not save_folder.exists():
#     save_folder.mkdir(parents=True)
# for i, col in enumerate(plot_data.columns):
#     if col == "cluster":
#         continue
#     fig, ax = plt.subplots(figsize=(8, 6))
#     sns.boxplot(data=plot_data, x="cluster", y=col, ax=ax).set(title=col, xlabel=None, ylabel=None)
#     plt.savefig(save_folder / f"{col}_boxplot_colored_by_cluster.png", transparent=True)
#     plt.close()
with plt.style.context("dark_background"):
    combined_df["finished_normalized"] = (
        combined_df["finished"] - combined_df["finished"].mean()
    ) / combined_df["finished"].std()
    plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING + ["finished_normalized"]].copy()
    plot_data["cluster"] = clustering.labels_
    plot_data["cluster"] = plot_data["cluster"].astype("category")
    n_features = len(plot_data.columns) - 1
    n_cols = 4
    n_rows = int(np.ceil(n_features / n_cols))
    fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
    fig.suptitle("Boxplots of normalized features colored by cluster", fontsize=20)
    plt.subplots_adjust(top=0.95)
    axs = axs.flatten()
    for i, col in enumerate(plot_data.columns):
        if col == "cluster":
            continue
        sns.boxplot(data=plot_data, x="cluster", y=col, ax=axs[i], hue="cluster", palette="tab10")
        utils.set_axis_style(axs, i)
    plt.show()
_images/99de4c5d3563710d4ce106df933289860419738bc7f28d1d2b7fc44ab8be457f.png

Heatmap of features means in each cluster#

# # heatmap of features means per cluster
# means = plot_data.groupby(plot_data["cluster"]).mean()
# sns.heatmap(means, cmap="coolwarm", annot=True, fmt=".1f", vmin=-1, vmax=1)
# plt.title("Heatmap of normalized features means per cluster")
# plt.savefig(FIG_PATH / "heatmap_features_means_per_cluster.png", transparent=True, bbox_inches="tight")
# plt.show()
# heatmap of features means per cluster
feat_labels = utils.get_feature_names_labels()
with plt.style.context("dark_background"):
    means = plot_data.groupby(plot_data["cluster"]).mean()
    sns.heatmap(means, cmap="coolwarm", annot=True, fmt=".1f", vmin=-1, vmax=1)
    plt.title("Heatmap of normalized features\nmeans per cluster", fontsize=25)
    plt.ylabel("Cluster", fontsize=20)
    plt.xlabel("Feature", fontsize=20)
    plt.xticks(labels=feat_labels+["Finished (norm)"], ticks=np.arange(len(feat_labels)+1)+0.5, rotation=90, fontsize=15)
    plt.show()
_images/a2334ccc68883998453ba0fbec49bbf42af8f416e41ef0ce2e54751131db654e.png

Number of paths per cluster#

# clusters_names,n_points = np.unique(plot_data["cluster"], return_counts=True)
# sns.barplot(x=clusters_names, y=n_points).set_title("Number of paths per cluster")
# plt.savefig(FIG_PATH / "number_of_paths_per_cluster.png", transparent=True)
# plt.show()
clusters_names,n_points = np.unique(plot_data["cluster"], return_counts=True)
with plt.style.context("dark_background"):
    sns.barplot(x=clusters_names, y=n_points,hue=range(len(clusters_names)), palette="tab10").set_title("Number of paths per cluster", fontsize=20)
    # make background grey
    plt.gca().patch.set_facecolor("#d3d3d3")
    # update legend style
    leg = plt.gca().get_legend()
    leg.get_frame().set_facecolor("#d3d3d3")
    leg.set_title("Cluster")
    leg_title = leg.get_title()
    leg_title.set_color("black")
    # change all legend text to black
    [text.set_color("black") for text in leg.get_texts()]
    plt.xlabel("Cluster", fontsize=14)
    plt.ylabel("Number of paths", fontsize=14)
plt.show()
_images/0466f0fd2f92cbcb088f6f86f23e171dac26db1ca558425e08eaa84d4cd7c57b.png

Number of restarts per cluster#

# # number of restarts per cluster
# plot_data["restart"] = combined_df["type"] == "restart"
# timeout_per_cluster = plot_data.groupby(plot_data["cluster"])["restart"].sum()
# sns.barplot(x=clusters_names, y=timeout_per_cluster).set_title(
#     "Number of restart per cluster"
# )
# plt.savefig(FIG_PATH / "number_of_restart_per_cluster.png", transparent=True)
# number of restarts per cluster
with plt.style.context("dark_background"):
    plot_data["restart"] = combined_df["type"] == "restart"
    timeout_per_cluster = plot_data.groupby(plot_data["cluster"])["restart"].sum()
    sns.barplot(x=clusters_names,
                y=timeout_per_cluster,
                hue=range(len(clusters_names)),
                palette="tab10",
                ).set_title(
        "Number of restart per cluster",
        fontsize=20
    )
    plt.gca().patch.set_facecolor("#d3d3d3")
    leg = plt.gca().get_legend()
    leg.get_frame().set_facecolor("#d3d3d3")
    leg.set_title("Cluster")
    leg_title = leg.get_title()
    leg_title.set_color("black")
    [text.set_color("black") for text in leg.get_texts()]
    plt.xlabel("Cluster", fontsize=14)
    plt.ylabel("Number of restarts", fontsize=14)
plt.show()
_images/e13793de142208375f57833bf54af16a2657c9518dfea508c3974ea2ce09d4f0.png
if not FILTER_TIMEOUT:
    # number of timeouts per cluster
    plot_data["timeout"] = combined_df["type"] == "timeout"
    timeout_per_cluster = plot_data.groupby(plot_data["cluster"])["timeout"].sum()
    sns.barplot(x=clusters_names, y=timeout_per_cluster).set_title(
        "Number of timeouts per cluster"
    )
    plt.savefig(FIG_PATH / "number_of_timeouts_per_cluster.png", transparent=True)

Discussion#

Let’s try to name and characterize these groups ! To do so, we will mainly use the heatmap and some Welch’s T test.#

combined_df["cluster"] = clustering.labels_
# replace nan by finished
combined_df["type"]=combined_df["type"].fillna("finished")
np.random.seed(1)
sample=combined_df[combined_df["cluster"]==2][["path","type"]].sample(5).values
for path, type_ in sample:
    print(f"Path: {path}, type: {type_}")
Path: ['Pollution', 'Kyoto_Protocol', 'European_Union', 'Austria', 'Vienna'], type: finished
Path: ['Cod', 'Vitamin_D', 'Calcium', 'Periodic_table', 'Lithium'], type: finished
Path: ['Homestar_Runner', 'Japan', 'Asia', 'Europe', 'Republic_of_Ireland', 'Dublin'], type: finished
Path: ['Malta', 'Tunisia', 'North_Africa', 'United_Nations', 'World_Health_Organization', 'AIDS', 'HIV'], type: finished
Path: ['Citrus', 'Juice', 'Water', 'River', 'Meander'], type: finished
np.random.seed(1)
sample=combined_df[combined_df["cluster"]==0][["path","type"]].sample(5).values
for path, type_ in sample:
    print(f"Path: {path}, type: {type_}")
Path: ['Asteroid', 'Comet', 'Earth', 'Ocean', 'Arctic_Ocean', 'Norway', 'Viking'], type: finished
Path: ['Christian_Bale', 'Theatre', 'Aristotle', 'Ancient_Greece', 'Christianity', 'Religion', 'Mythology', 'Greek_mythology'], type: finished
Path: ['Nintendo', 'Japan', 'Temperate', 'Arctic_Circle', 'Arctic', 'Greenland', 'Snow'], type: finished
Path: ['Woodworking', 'France', 'United_Kingdom', 'Manchester_United_F.C.', 'Denis_Law'], type: finished
Path: ['Norman_Borlaug', 'United_States', 'Television', 'DVD', 'Compact_Disc'], type: finished
np.random.seed(1)
sample=combined_df[combined_df["cluster"]==4][["path","type"]].sample(5).values
for path, type_ in sample:
    print(f"Path: {path}, type: {type_}")
Path: ['Erbium', 'Sweden', 'World_War_I', 'Paris', 'France', 'Algeria', 'Algiers'], type: finished
Path: ['Weight_training', 'Sand', 'Dune'], type: finished
Path: ['Coffee', 'Cancer', 'DNA', 'DNA_repair'], type: finished
Path: ['Natalie_Portman', 'World_War_II', 'Flanders'], type: finished
Path: ['Pterosaur', 'Cretaceous', 'Iridium', 'Chemical_element', 'Periodic_table', 'Sodium'], type: finished
np.random.seed(1)
sample=combined_df[combined_df["cluster"]==1][["path","type"]].sample(5).values
for path, type_ in sample:
    print(f"Path: {path}, type: {type_}")
Path: ['Formula_One', 'People%27s_Republic_of_China'], type: finished
Path: ['Speech_synthesis', 'Austria'], type: restart
Path: ['Zionism', 'God'], type: restart
Path: ['Bird', 'People%27s_Republic_of_China', 'United_Nations', 'Germany', 'Adolf_Hitler'], type: finished
Path: ['Functional_programming', 'Scheme_programming_language'], type: restart